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Abstract 

We use optimism to introduce generic asymptotically optimal rein- 
forcement learning agents. They achieve, with an arbitrary finite or com- 
pact class of environments, asymptotically optimal behavior. Further- 
more, in the finite deterministic case we provide finite error bounds. 
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2 ■ 1 Introduction 

This article studies a fundamental question in artificial intelligence; given a set 
of environments, how do we define an agent that eventually acts optimally re- 
gardless of which of the environments it is in. This question relates to the even 
more fundamental question of what intelligence is. |Hut05) defines an intelli- 
gent agent as one that can act well in a large range of environments. He studies 
arbitrary classes of environments with particular attention to universal classes 
of environments like all computable (deterministic) environments and all lower 
semi-computable (stochastic) environments. He defines the AIXI agent as a 
Bayesian reinforcement learning agent with a universal hypothesis class and a 
Solomonoff prior. This agent has some interesting optimality properties. Be- 
sides maximizing expected utility with respect to the a priori distribution by 
design, it is also Pareto optimal and self-optimizing when this is possible for the 
considered class. It was, however, shown in |OrslO| that it is not guaranteed 
to be asymptotically optimal for all computable (deterministic) environments. 
[LHllaj shows that this is not surprising since, at least for geometric discount- 
ing, no agent can be. [LHllaj also shows that in a weaker (in average) sense, 
optimality can be achieved for the class of all computable environments using 
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an algorithm that includes long exploration phases. Furthermore, it is simple to 
realize that Bayesian agents do not always achieve optimality for a finite class 
of deterministic environments even if all prior weights are strictly positive. 

We use the principle of optimism to define an agent that for any finite class 
of deterministic environments, eventually acts optimally. We extend our results 
to the case of finite and compact classes of stochastic environments. In the 
deterministic case we also prove finite error bounds. Optimism has previously 
been used to design exploration strategies for both discounted and undiscounted 
MDPs jKS98[ [SL051 IAO06[ ILHT2] . though here we define optimistic algorithms 
for any finite class of environments. 

Related work. Besides AIXI [HutOSj that was discussed above, |LHlla| in- 
troduces an agent which achieves asymptotic optimality in an average sense 
for the class of all deterministic computable environments. There is, however, 
no time step after which it is optimal at every time step. This is due to an 
infinite number of long exploration phases. We introduce an agent, that for 
finite classes of environments, does eventually achieve optimality for every time 
step. For the stochastic case, the agent achieves with any given probability, 
optimality within e for any e > 0. Our very simple agent is relying elegantly 
on the principle of optimism, used previously in the restrictive MDP case with 
discounting [KSQS) [5L051 ILH12j and without [AO06 j, inste ad of an indefinite 
number of explicitly enforced bursts of exploration. jRH08] also introduces an 
agent that relies on bursts of exploration with the aim of achieving asymptotic 
optimality. The asymptotic optimality guarantees are restricted to a setting 
where all environments satisfy a certain restrictive value-preservation property. 
[EDKMdS] studied learning general Partially Observable Markov Decision Pro- 
cesses (POMDPs). Though POMDPs constitute a very general reinforcement 
learning setting, we are interested in agents that can be given any (deterministic 
or stochastic) class of environments and successfully utilize the knowledge that 
the true environment lies in this class. 

Background. We will consider an agent |RN101 IHutOS] that interacts with 
an environment through performing actions at from a finite set A and receives 
observations ot from a finite set and rewards rt from a finite set TZ C [0, 1]. 
Let 'H X X i?)* be the set of histories and i? : 7{ — > R the return 



with the obvious extension to infinite sequences. A function from % y. A to 
X 7^ is called a deterministic environment (studied in Section [51 A function 
: H ^ A \s called a policy or an agent. We define the value function V 
by VJ{ht-i) := R{ht:oo) — J2iLt^^~*''^i where the sequence are the rewards 
achieved by following tt from time step t onwards in environment v after having 
seen ht-i- 

Instead of viewing the environment as a function from H x A to x TZ we 
can equivalently write it as a function i' : TL x A x x TZ {0,1} where we 
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write i/(o, r\h,a) for the function value of {h, a, o, r). It equals zero if in the first 
formulation {h, a) is not sent to (o, r) and 1 if it is. In the case of stochastic 
environments, which we will study in Section |31 we instead have a function 
v : H X A X X TZ ^ [0,1] such that J2o r ^^^^ '^1'*' o) = 1 V/i, a. Furthermore, 
we define v{ht\'K) :— v{ori:t\TT) := n*^j^i/(oiri|ai, where — Tr{hi-i). 
^'(•|7r) is a probability measure over strings or sequences as will be discussed in 
the next section and we can define i'(-|7r, ht-i) by conditioning on ht^i- 

We define VJ{ht-i) := '^v{-\-K,ht-i)R{ht:oo) as the i^-expected return of policy it. 

A special case of an environment is a Markov Decision Process (MDP) 
|SB98) . This is the classical setting for reinforcement learning. In this case 
the environment does not depend on the full history but only on the latest ob- 
servation and action and is, therefore, a function from QxAxQxTZto [0, 1]. 
In this situation one often refers to the observations as states since the latest 
observation tells us everything we need to know. In this situation, there is an 
optimal policy that can be represented as a function from the state set iS (:=0) 
to A. We only need to base our decision on the latest observation. Several al- 
gorithms |KS981 [SL051 ILH12] have been devised for solving discounted (7 < 1) 
MDPs for which one can prove PAC (Probably Approximately Correct) bounds. 
They are finite time bounds that hold with high probability and depend only 
polynomially on the number of states, actions and the discount factor. These 
methods are relying on optimism as the method for making the agent sufficiently 
explorative. Optimism roughly means that one has high expectations for what 
one does not yet know. Optimism was also used to prove regret bounds for 
undiscounted (7 = 1) MDPs in |AO06j which was extended to feature MDPs 
in |MMRll] . Note that these methods are restricted to MDPs and that we 
do not make any (Markov, ergodicity, stationarity, etc.) assumptions on the 
environments, only on the size of the class. 

Outline. In this article we will define optimistic agents in a far more general 
setting than MDPs and prove asymptotic optimality results. The question of 
their mere existence is already non-trivial, hence asymptotic results deserve 
attention. In Section [5] we consider finite classes of deterministic environments 
and introduce a simple optimistic agent that is guaranteed to eventually act 
optimally. We also provide finite error bounds. In Section [3] we generalize to 
finite classes of stochastic environments and in Section |3] to compact classes. 

2 Finite Classes of Deterministic Environments 

Given a finite class of deterministic environments M. — {vi, ...,Vm}, we define 
an algorithm that for any unknown environment from Ai eventually achieves 
optimal behavior in the sense that there exists T such that maximum reward is 
achieved from time T onwards. The algorithm chooses an optimistic hypothesis 
from M. in the sense that it picks the environment in which one can achieve the 
highest reward (in case of a tie, choose the environment which comes first in an 
enumeration of A^) and then the policy that is optimal for this environment is 
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followed. If this hypothesis is contradicted by the feedback from the environ- 
ment, a new optimistic hypothesis is picked from the environments that are still 
consistent with h. This technique has the important consequence that if the 
hypothesis is not contradicted we are still acting optimally when optimizing for 
this incorrect hypothesis. 

Require: Finite class of deterministic environments A^o = 
1: t^l 
2: repeat 

3: (tt*, u*) e argmax^gn,i.eA4t-i Vuiht-i) 
4: repeat 

5: at = 1l*{ht-i) 

6: Perceive Otrt from environment /i 
7: ht <- ht-iatOtn 

8: Remove all inconsistent environments from Mt 
{Mt ■.= {iyeMt-i:hf''' ^ht}) 

9: t -ir- t + l 

10: until I/* ^ Mt-i 
11: until A4 is empty 

Algorithm 1: Optimistic Agent (tt") for Deterministic Environments 

Let hf''^ be the history up to time t generated by policy tt in environment 
v. In particular let h° := •'^ be the history generated by Algorithm 1 (policy 
7r°) interacting with the actual "true" environment /i. At the end of cycle t we 
know /ij = ht- An environment v is called consistent with ht ii = ht- 

Let Mt be the environments consistent with ht- The algorithm only needs to 
check whether = Ot and rl = rt for each v G A^t-i, since previous 

cycles ensure = ht-i and trivially = at- The maximization in 

Algorithm 1 that defines optimism at time t is performed over all v G M.t, the 
set of consistent hypotheses at time and tt e 11 = 11°" is the class of all 
deterministic policies. 

Theorem 1 (Optimality, Finite Deterministic Class). // we use Algorithm 1 
('K° ) in an environment fi Cz A4 , then there is T < oo such that 

v;^°iht) = maxv^iht) yt > T. 

A key to proving Theorem [T] is time-consistency [LHllbj of geometric dis- 
counting. The following lemma tells us that if we act optimally with respect to 
a chosen optimistic hypothesis, it remains optimistic until contradicted. 

Lemma 2 (Time-consistency). Suppose {Tr*,^*) e aigmax^^YiveMt^i^^^t)' 
that we act according to tt* from time t to time t — 1 and that v* is still consistent 
at time t> t , then (tt*,!/*) £ argmax^gjj Vj7(/i(). 

Proof. Suppose that V^^ {h^) < {h^) for some tt, v. It holds that V^' {ht) = 
C + 7*^*Vj7* {ht) where C is the accumulated reward between t and t — 1. Let 
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TT be a policy that equals it* from t to t — 1 and then equals tt. It follows that 
Vj^iht) = C + 7*-*Fp*(/it) > C + 7*-*V'Ji:* (hi) = VX {ht) which contradicts the 
assumption (Tr*,;^*) e argmax^gn,i.eMt CC^^O- Therefore, > V7(/it) 

for all TT, I'. ■ 



Proof. (Theorem [T]) At time t we know ht- If some G A^t-i is inconsistent 
with ht, i.e. /i^ ^ ht, it gets removed, i.e. is not in AAt' for all > t. 

Since A^o = is finite, such inconsistencies can only happen finitely often, 
i.e. from some T onwards we have Mt = -Moo ioi- all t > T. Since h^ — ht Vt, 
we know that fi G Mt Vi. 

Assume t > T henceforth. The optimistic hypothesis will not change after 
this point. If the optimistic hypothesis is the true environment /i, we have 
obviously chosen the true optimal policy. 

In general, the optimistic hypothesis i'* is such that it will never be con- 
tradicted while actions are taken according to tt°, hence (tt*, ly*) do not change 
anymore. This implies 

Vfiht) - V^'iht) = vXiht) = maxmaxC(/it) > maxy;(/i,) 

for all t > T. The first equality follows from tt° equals tt* from t > T on- 
wards. The second equality follows from consistency of ly* with h^.^. The third 
equality follows from optimism, the constancy of tt*, ly* , and Ait for t > T, 
and time-consistency of geometric discounting (Lemma [2]). The last inequality 
follows from ij, G Ait- The reverse inequality (ht) < maxT^ V^{ht) follows 
from TT* G n. Therefore 7r° is acting optimally at all times t > T. ■ 



Besides the eventual optimality guarantee above, we also provide a bound 
on the number of time steps for which the value of following Algorithm 1 is 
more than a certain e > less than optimal. The reason this bound is true is 
that we only have such suboptimality for a certain number of time steps before 
a point where the current hypothesis becomes inconsistent and the number of 
such inconsistency points are bounded by the number of environments. 

Theorem 3 (Finite error bound). Following 7t° (Algorithm 1), 
V;:\ht) > maxV^{ht)-e, 0<e< 1/(1-7) 

for all but at most \Ai \ time steps t. 

Proof. Consider the ^-truncated value 

t+i 
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where the sequence are the rewards achieved by following tt from time t + I 
to t + £ in after seeing ht- By letting £ = '"^^^^J""^-* (which is positive due to 
negativity of both numerator and denominator) we achieve \ V^g{ht) — VJ {ht)\ < 

— e. Let (tt^jI/j*) be the policy-environment pair selected by Algorithm 2 
in cycle t. 

Let us first assume h'^_^_'-l^f_^_g = h^t+i-l+v i-^- is consistent with 
and hence tTj and do not change from t+1, ...,t + 1 (inner loop of Algorithm 
1). Then 

drop terms, same /it^i-t^^, 7r° = 7rj on /it^i-t+f, 

> V,i{ht)-j^ = maxmaxC(/it)-e > maxV;{ht)-e. 

bound extra terms def. of (tt*, i/*) and e := ^ H e Mt 

Now let ti,...,tK be the times t at which the currently selected i'^ gets 
inconsistent with ht, i.e. — {t ■ ^ M^t}- Therefore h^^-^.^^f^ ^ 

h^j^i^t^f (only) at times t e Tx ■= [jf=i{ti-^, •■•j which implies V^^" (ht) > 

maxTTGn V^{ht) — s except possibly for t G 7i . Finally 

ITxl = i-K < i.\M\ = '""^f'-^^ Ml < \M\'-2i^^k^ 

log7 7-1 



We refer to the algorithm above as the conservative agent since it sticks to 
its model for as long as it can. The corresponding liberal agent reevaluates 
its optimistic hypothesis at every time step and can switch between different 
optimistic policies at any time. Algorithm 1 is actually a special case of this as 
shown by Lemma [2] The liberal agent is really a class of algorithms and this 
larger class of algorithms consists of exactly the algorithms that are optimistic 
at every time step without further restrictions. The conservative agent is the 
subclass of algorithms that only switch hypothesis when the previous is contra- 
dicted. The results for the conservative agent can be extended to the liberal 
one, but we have to omit that here for space reasons. 

3 Stochastic Environments 

A stochastic hypothesis may never become completely inconsistent in the sense 
of assigning zero probability to the observed sequence while still assigning very 
different probabilities than the true environment. Therefore, we exclude based 
on a threshold for the probability assigned to the generated history. Unlike in 
the deterministic hypothesis can cease to be the optimistic one without 

having been excluded. We, therefore, only consider an algorithm that reeval- 
uates its optimistic hypothesis at every time step. Algorithm 2 specifies the 
procedure and Theorem [4] states that it is asymptotically optimal. 
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Require: Finite class of stochastic environments A4i = A4, threshold 
zG (0,1) 
1: t = 1 

2: repeat 

3: (tt*,!/*) = argmax^^^g^^ VJ^(/it-i) 
4: at=Tr*{ht-i) 

5: Perceive otrt from environment fj, 
6: ht ^ ht-iatOtrt 

7: t^t+l 

8: Mt := {ly e Mt-1 : ^ > z} 

9: until the end of time 

Algorithm 2: Optimistic Agent (7r°) with Stochastic Finite Class 

Theorem 4 (Optimality, Finite Stochastic Class). Define tt° by using Algo- 
rithm 2 with any threshold z G (0, 1) and a finite class M. of stochastic envi- 
ronments containing the true environment fi, then with probability 1 — z\A4 — 1| 
there exists, for every e > 0, a number T < oo such that 

Vfiht) > maxV^{ht)-e\^t>T. 

We borrow some techniques from }Hut09| that introduced a "merging of 
opinions" result that generalized the classical theorem by |BD62] . The classical 
result says that it is sufficient that the true measure (over infinite sequences) is 
absolutely continuous with respect to a chosen a priori distribution to guarantee 
that they will almost surely merge in the sense of total variation distance. The 
generalized version is given in Lemma [6] When we combine a policy tt with an 
environment v by letting the actions be taken by the policy, we have defined 
a measure, denoted by i^{-\tt), on the space of infinite sequences from a finite 
alphabet. We denote such a sample sequence by uj and the a:th to 6:th elements 
of Lu by U!a:b- The cr-algebra is generated by the cylinder sets Ty^_^ := {ivluji-t = 
yi:t} and a measure is determined by its values on those sets. To simplify 
notation in the next lemmas we will write P(-) = i^(-|vr), meaning that P(wi:t) = 
u(ht\ai-t) where iOj ~ Ojrj and aj ~ ■n{hj^i). Furthermore, z/(-|ft,t,7r) = P{-\ht). 

Definition 5 (Total Variation Distance). The total variation distance between 
two measures (on infinite sequences oj of elements from a finite alphabet) P and 
Q is defined to be 

d{P,Q) = sup|P(A) -g(A)| 

A 

where A is in the previously specified a-algebra generated by the cylinder sets. 

The results from |Hut09) are based on the fact that Zt = p|^j|"'j is a martin- 
gale sequence if P is the true measure and therefore converges with P probability 
1 |Doo53j . The crucial question is if the limit is strictly positive or not. The 
following lemma shows that with P probability 1 we are either in the case where 
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the limit is or in the case where d{P{-\uji;t),Q{-\uji:t)) — > 0. We say that the 
environments vi and i>2 merge under tt if d(i^i(-|7r), :^2('|7'')) ^ 0. 

Lemma 6 (Generahzed merging of opinions |Hut09] ) . For any measures P and 
Q it holds that P{n° U fi) = 1 where 

n° := {u : 9!^t^ ^ 0} and Ct -.^ [u : d{P{-Wt), Qi-WA) 0} 

Lemma 7 (Value convergence for merging environments). Given a policy n and 
environments /i and v it follows that 

\v;:{h)-v:{ht)\ < -^difii-\ht,n),,.{.\h,,n)). 

i — 7 

Proof. The lemma follows from the general inequality 

|Ep(/)-Eq(/)| < sup|/|.sup|P(A)-g(A)| 

A 

by inserting / :— R^ujf.oo) and P = fi{-\ht,7r) and Q — i^(-|/it, tt), and using 
0</< 1/(1-7)- ■ 



The following lemma replaces the property for deterministic environments 
that either they are consistent indefinitely or the probability of the generated 
history becomes 0. 

Lemma 8 (Merging of environments) . Suppose we are given two environments 
fi (the true one) and v and a policy n (defined e.g. by Algorithm 2). Let P(-) = 
and Q{-) — i^(-|7r). Then with P probability 1 we have that 

1™#^=0 or \im\V;ih,)-V:ih,)\^0. 
Proof. This follows from a combination of Lemma |6] and Lemma [T] ■ 



The next lemma tells us what happens after all the environments that will 
be removed have been removed but we state it as if this was time i = for 
notational simplicity. 

Lemma 9 (Optimism is nearly optimal). Suppose that we have a (finite or 
infinite) class of (possibly) stochastic environments A4 containing the true en- 
vironment jjL. Also suppose that none of these environments are excluded at any 
time by Algorithm 2 (tt° ) during an infinite history h that has been generated 
by running tt° in fi. Given e > there is e > such that 

if 

\V:°{ht) - vXiht)\ < e yt,yv^,v2 e M. 
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Proof. (Theorem [4]) Given a policy tt, let P(-) = ^{■\'k) where n & M. is the 
true environment and Q = v{-\'k) where v € AA. Let the outcome sequence (the 
sequence (oiri), (o2r2), ...) be denoted by u). It follows from Doob's Martingale 
inequality |Doo53l that for all z G (0, 1) 

P(sup^^^^ > 1/z) < z , which implies P(inf < z) < z. 

t P{^^l:t) t Qij^l-.t) 

This proves, using a union bound, that the probability of Algorithm 2 ever 
excluding the true environment is less than z\Ai — 1|. 

The limits converge almost surely as argued before using the Martin- 

gale convergence theorem. Lemma [8] tells us that any given environment (with 
probability one) is eventually excluded or is permanently included and merge 
with the true one under 7r°. The remaining environments does, according to 
(and in the sense of) Lemma [8l merge with the true environment. Lemma [7] 
tells us that the difference between value functions (for the same policy) of merg- 
ing environments converges to zero. Since there are finitely many environments 
and the ones that remain indefinitely in A4t merge with the true environment 
under 7r°, there is for every e > a T such that when following 7r°, it holds for 
alH > T that 

\V:° (ht) - VX {ht)\ < e Vt^i, e Mt. 

The proof is concluded by Lemma [9] in the case where the true environment 
remains indefinitely included which happens with probability z|A^ — 1|. ■ 



4 Compact Classes 

In this section we discuss infinite but compact classes of stochastic environ- 
ments. First note that without further assumptions, asymptotic optimality can 
be impossible to achieve, even for countably infinite deterministic environments 
|LHlla) . Here we consider classes that are compact with respect to the total 
variation distance, or more precisely with respect to 

d{i'i,V2) — uYa,yid{ui{-\h, ■K),U2{-\h, tt)) 

where d is total variation distance from Section [31 An example is the class 
of Markov Decision Processes (or POMDPs) with a certain number of states. 
Algorithm 2 does need modification to achieve asymptotic optimality in the 
compact case. An alternative to modifying the algorithm is to be satisfied 
with reaching optimality within a pre-chosen e > 0. This can be achieved 
by first choosing a finite covering of M with balls of total variation radius 
less than e(l — 7) and use Algorithm 2 with the centers of these balls. To 
have an algorithm that for any e > eventually achieves optimality within e 
is a more demanding task. This is because we need to be able to say that 
the true environment will remain indefinitely in the considered class with a 
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given confidence. For this purpose we introduce a confidence radius inspired by 
MDP solving algorithms like MBIE [SL05] and UCRL |AO06| . We still use the 
notation Ait as in Algorithm 2 and we define Algorithm 3 based on replacing 
it with a larger A4t. If we do not do this the true environment is likely to be 
excluded. 

Definition 10 (Confidence radius). We denote all environments within from 
Mt by 

Mt ■■= {ly e M \ 3i> e Mt ■■ d{v, v) < r^}. 

Given z > we say that rf{ht) is a p-confidence radius sequence if r^{ht) — >• 
almost surely and if the true environment is in Mt for all t with probability p. 

Definition 11 (Algorithm 3). Given a class of environments M that is compact 
in the total variation distance we define Algorithm 3 as being Algorithm 2 with 
Ait replaced by Ait 

Definition 12 (Radon- Nikodym difFerentiable class). Suppose that the class Ai 
is such that if p, ^ Ad is the true environment, then for any policy it it holds with 
probability one that for all v G Ai, Xt^i, := converges as t ^ oo to some 

random variables X^. We call such a class Radon- Nikodym (RN) differentiable. 
If the property holds with respect to a specific policy tt we say that the class is 
RN-differentiable with respect to vr. 

Remark 13. Every countable class is RN-differentiable and so is the class of 
MDPs with a certain number of states. The MBIE \SL05^ and UGRL \AO06f 
algorithms are based on the fact that one can define confidence radiuses for 
MDPs, though their bounds need separate intervals for each state-action pair 
depending on the number of visits. For an ergodic MDP all state-action pairs will 
almost surely be seen infinitely often and the max length of those intervals will 
tend to zero. Therefore, one can define a radius based on this maximum length 
or, alternatively, one can easily allow Algorithm 3 to run with such rectangular 
sets instead. 

Theorem 14 (Optimality, Compact Stochastic Class). Suppose we use Algo- 
rithm 3 with threshold z G (0, 1), a compact (in total variation) RN-differentiable 
class (with respect to 7r° is enough) Ai of stochastic environments and a p- 
confidence radius sequence for A4. Denote the resulting policy by Tr° . If the 
true environment n is in Ai, then with probability p there is, for every e > 0, a 
tim e T < oo such that 

V^°{ht) > ma.xV; (ht) ~e'it>T. 

Lemma 15 (Uniform exclusion). Let Qp{-) = j^(-|7r°) and P{-) = /i(-|7r°) where 
p is the true environment and 7r° the policy defined by Algorithm 3. For any 
outcome sequence uj, let 
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For any closed subset of M^{ui) and for every z > 0, there is T < oo such that 
for every v in this subset there is t < T such that '^p^^^'^-j < z. 

Proof. Since is compact and the subset in question is closed it follows that 
it is also compact. Using the Arzela-Ascoli Theorem |Rud76| we conclude that 

there is a subsequence tk such that := min{l, ^7^—^-^} converges uniformly 

to on Ai^ which means that there is tk such that Zj^ < z for all u G A^" and 
we can let t = T = tk- ■ 



Proof. (Theorem I14p The strategy is to use that all environment that will be 
excluded and does not lie within a certain distance of some environment that 
merges with the true one, will be excluded after a certain finite time. Then we 
can say that the remaining environments' value functions differ at most by a 
certain amount and we can apply Lemma 13 

We can with probability one say that for each e , it will hold that Zt = 
converges and each environment will be in — {i> £ M. \ Zt ^ 0} ot 

M = {v \ d{v{-\ht,TT°), ^{■\ht,TT°)) 0}. M is compact (in the total variation 
distance topology) since it is a closed subset (again in the topology defined by 
d) of the compact set A4. 

For any £1 > we can do the following: For each v ^ M., consider a 
total variation ball of radius 25 where 5 — [1 — 7)£i/4. Note that \VJ [ht) — 
V^, {ht)\ < ei/2 for all t whenever d{v,v') < 25. The collection of these balls 
induces an open cover of the compact set A4 and it follows that there is a finite 
subcover. Consider the balls in this finite cover that intersect with Ai. Let A 
be the union of these finitely many open balls. Let B = A4 \ A. B is then a 
closed subset of A^*^. We want to say that there is a finite time after which all 
environments in B will have been excluded from A4t. This happens if B, defined 
as the union of the closed balls of radius at every point in B, has been excluded 
from Ait. If i is large enough for rf < 5, then B is also a closed subset of A^o- 
Lemma [15] tells us that all of the environments in B will have been excluded 
from A4.t after a finite amount of time Ti and, therefore, all the environments 
in B will have been excluded from A^j. Thus A4t C .4 > Ti and in particular 
the optimistic hypothesis v* will be in A when t > Ti. Let i'*{= v^) be the 
optimistic hypothesis at time t >Ti and 7r*(= tTj ) the optimistic policy. 

Each parameter in A (and in particular 1/*) lies within 5 of a ball with 
center 1/ which lies within 5 oi a. point D £ Ai. Hence d{v* ,v) < 25 and 
\V::iht)-V,-\ht)\<e,/2. _ ^ 

Due to the uniform merging of environments (under 7r°) on Ai, there is 
T2 > Ti such that iV^iht) - VX{ht)\ < £i/2 Vi/i,i/2 e M ^t > T2. We 
conclude that \VJ[° (ht) - VJ^° {ht)\ < £1 Vi^i, e ^ Vt > T2 and since Mt C A 

\VJ°{ht) - vX{ht)\ < £1 Vi/1,1/2 eMtyt> T2. 

From Lemma |9] we know that if we picked £1 small enough we know that for 
t > T2, VJ'°{ht) > VJ{ht) - e/2 for aU tt e H, e Mt- Furthermore, by 
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picking ii sufRciently small wc can, for t > T2, ensure that there is j> €E Ait 
such that |Vjf (ht) — {ht)\ < e/2. Given that the true environment remains 
indefinitely in Mt, which happens with at least probability p, it follows that 

V7°(/i() > maxV7(/it) -e yt>T2. ^ 

5 Conclusions 

We introduced optimistic agents for finite and compact classes of arbitrary en- 
vironments and proved asymptotic optimality. In the deterministic case we also 
bound the number of time steps for which the value of following the algorithm 
is more than a certain amount lower than optimal. Future work includes inves- 
tigating finite-error bounds for classes of stochastic environments. 
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